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In 2014, Steele, Lea, and Flood presented SprirM1x, an object-oriented pseudorandom number generator (PRNG) 
that is quite fast (9 64-bit arithmetic/logical operations per 64 bits generated) and also splittable. A conventional 
PRNG object provides a generate method that returns one pseudorandom value and updates the state of the 
PRNG; a splittable pRNG object also has a second operation, split, that replaces the original pRNG object with 
two (seemingly) independent pRNG objects, by creating and returning a new such object and updating the 
state of the original object. Splittable pRNG objects make it easy to organize the use of pseudorandom numbers 
in multithreaded programs structured using fork-join parallelism. This overall strategy still appears to be 
sound, but the specific arithmetic calculation used for generate in the SpLITMix algorithm has some detectable 
weaknesses, and the period of any one generator is limited to 2°. 

Here we present the LXM family of prnc algorithms. The idea is an old one: combine the outputs of two 
independent prNG algorithms, then (optionally) feed the result to a mixing function. An LXM algorithm uses 
a linear congruential subgenerator and an F2-linear subgenerator; the examples studied in this paper use a 
linear congruential generator (LCG) of period 21°, 2°”, 264, or 2128 with one of the multipliers recommended 
by L’Ecuyer or by Steele and Vigna, and an F2-linear xor-based generator (XBG) of the xoshiro family or 
xoroshiro family as described by Blackman and Vigna. For mixing functions we study the MurmurHash3 
finalizer function; variants by David Stafford, Doug Lea, and degski; and the null (identity) mixing function. 

Like SptiTMrx, LXM provides both a generate operation and a split operation. Also like SpLrrM1x, LXM 
requires no locking or other synchronization (other than the usual memory fence after instance initialization), 
and is suitable for use with simp instruction sets because it has no branches or loops. 

We analyze the period and equidistribution properties of LXM generators, and present the results of 
thorough testing of specific members of this family, using the TestU01 and PractRand test suites, not only on 
single instances of the algorithm but also for collections of instances, used in parallel, ranging in size from 2 to 
2°4 | Single instances of LXM that include a strong mixing function appear to have no major weaknesses, and 
LXM is significantly more robust than SpLirMix against accidental correlation in a multithreaded setting. We 
believe that LXM, like SpLrirM1x, is suitable for “everyday” scientific and machine-learning applications (but 
not cryptographic applications), especially when concurrent threads or distributed processes are involved. 
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1 INTRODUCTION 


Many algorithms used for physical simulations and machine learning, ranging from Monte Carlo 
methods to stochastic gradient descent, consume sequences of values that are either chosen “truly 
at random” (by some sufficiently unbiased physical process) or generated by a deterministic 
computation chosen in hopes that the computed values will “appear to be random,’ at least for the 
purposes of the specific application. Such computed sequences of values are called pseudorandom. 
Any one who considers arithmetical methods of producing random digits is, of course, in a state of 
sin. For, as has been pointed out several times, there is no such thing as a random number—there 
are only methods to produce random numbers, and a strict arithmetic procedure is of course not 
such a method. [von Neumann 1951] 
Von Neumann made the preceding remark as part of his analysis of his middle-square method: 
In obtaining y as the middle ten [decimal] digits in the square of a ten-digit number x, we are really 
mapping x onto y by a certain sawtoothed discontinuous curve y = f(x), forO <x <1,0<y<1. 
When we take xi+1 = f(x;) fori = 1, 2,3,..., this curve will gradually scramble the digits of x; and 
produce something fairly pseudo-random. [von Neumann 1951] 
How can we know whether computed sequences “appear to be random”? By testing them: 
A pseudo-random sequence is a vague notion embodying the idea of a sequence in which each term 
is unpredictable to the uninitiated and whose digits pass a certain number of tests traditional with 
statisticians and depending somewhat on the uses to which the sequence is to be put. [Lehmer 1951] 
But what tests should be used? 
We are here dealing with mere “cooking recipes” for making digits; probably they cannot be justified, 
but should merely be judged by their results. Some statistical study of the digits generated by a 
given recipe should be made, but exhaustive tests are impractical. If the digits work well on one 
problem, they seem usually to be successful with others of the same type. [von Neumann 1951] 
Indeed, the very idea of testing a sequence to see whether it is “random” is somewhat paradoxical: 
. we would like to point out that all tests have a subjective element and that no test or sequence of 
such will establish a sequence of digits as being random.... if digits were generated by a [truly] 
random process then any method that rejected any sequence would be faulty. | [Hammer 1951] 
Nevertheless, Forsythe did apply statistical tests to three variations of von Neumann’s middle-square 
method; two failed quickly, and the third was then subjected to more rigorous tests: 
The null hypothesis was formulated that, insofar as ordinary statistical tests can detect, method C is 
a process for generating independent, equidistributed digits. To test the null hypothesis, four x” tests 
devised by Kendall and Babington Smith were applied [to a sequence of 100,000 generated digits]: 
(a) frequency test: frequency of 0, 1, 2,...,9 in first 50,000 digits; 
(b) serial test: frequency of 00, 01,...,99 in first 50,000 digits; 
(c) gap test: frequency of lengths of gaps between successive zero digits in entire sample; 
(d) poker test: frequency of combinations aaaa, aaab, aabb, aabc, abcd in all 25,000 groups of four 
consecutive digits. [Forsythe 1951] 
But while “method C” passed three of the tests, it failed test (b): 
In summary, method C is not recommended for the generation of random digits, because the 
distribution of pairs of digits appears to be variable and not uniform. [Forsythe 1951] 
The evolution of pseudorandom number generator (PRNG) algorithms over the last seven decades 
may be fairly characterized as a competition between generators and tests, driven in part by 
algorithmic improvements and in part by Moore’s Law. Ideally, each new value should be generated 
in polynomial time, but the generator output should not be predictable in polynomial time [Blum 
and Micali 1984]; there are generators satisfying this definition, but they are computationally 
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expensive and they depend presently on hardness assumptions [Blum et al. 1986]. In practice one 
often trades quality for speed; statistical testing helps to avoid trading away too much quality. 

There are also indirect methods for judging the quality of a pRNG algorithm, involving mathe- 
matical analysis of the code rather than running the code and testing the output. One of the earliest 
and most successful is the spectral test [Coveyou and Macpherson 1967; see also Knuth 1998, §3.3.4], 
which is applicable to what we now call linear congruential generators [Lehmer 1951; Rotenberg 
1960; Thomson 1958], in which the state of the generator is advanced by performing integer multi- 
plication by a fixed constant and then optionally adding another fixed constant. Marsaglia [1968] 
observed that all purely multiplicative congruential pRNG algorithms have unexpected biases when 
used to generate n-tuples of values; the spectral test quantified this bias, allowing tabulation of 
multipliers that minimize this bias [L’Ecuyer 1999b; Steele and Vigna 2021]. These results generalize 
to all linear congruential generators. 

Some generators, instead, treat their state as a vector of values in the field F, (multiplication 
is matrix-vector multiplication, often optimized by choosing the constant matrix so as to allow 
implementation using only a small number of sHirT and xor instructions), and these are called 
F,-linear generators. These have their own rich history, including Linear Feedback Shift Register 
techniques [Golomb 2006, 2017; Goresky and Klapper 2012; Wolfram 2016] and Marsaglia’s “xorshift 
generators” [Brent 2004; Marsaglia 2003]. More recent examples are the xoroshiro and xoshiro 
algorithms [Blackman and Vigna 2018; Vigna 2014-2021], which we build upon in this paper. Briefly 
put, they provide a systematic way to derive good F,-linear functions that can be computed cheaply 
using only xor, SHIFT, and ROTATE instructions; some versions use a “scrambler” (a fast output 
function, weaker than a full-blown hash function) to compensate for specific small flaws. 

The appeal of linear generators of either kind is that they are much more amenable to mathe- 
matical analysis. This is a double-edged sword: it is easier to prove certain favorable properties but 
also easier to expose deficiencies. 

Researchers began to explore compound generators. MacLaren and Marsaglia [1965] used one 
linear generator to “shuffle” the sequence produced by another. A more recent and widely used 
example is MRG32k3a [L’Ecuyer 1999a], which additively combined the outputs of two generators 
that have the same algorithmic structure but different moduli. There was also limited exploration 
of compounds whose component generators are of different kinds [L’Ecuyer and Granger-Piché 
2003], for example one using integer multiplication and one using bit-matrix multiplication. 

Congruential generators are liked, but not well-liked; ... combination generators seem best—if the 
numbers are not random, they are at least higgledy piggledy. [Marsaglia 1985] 
In the meantime, extensive test suites were developed, starting in the 1990s: Diehard [Marsaglia 
1995; Marsaglia and Tsang 2002], TestU01 [L’Ecuyer and Simard 2007, 2013; Simard 2009], Dieharder 
[Brown et al. 2003-2006], and the NIST Statistical Test Suite [Rukhin et al. 2001, 2010]. (L’Ecuyer 
has recently provided a detailed history of RNG algorithms and testing procedures [L’Ecuyer 2017].) 

With the emergence of multiprocessor (and later, multicore) computer architectures, users needed 
PRNG algorithms for multithreaded computations. If you simply use one copy of a specific algorithm 
on each processor, there is a risk that the generated sequences may be statistically correlated in 
some way. One approach is to use an algorithm that can generate an extremely long sequence, then 
make sure that each processor uses a different subsequence of that long sequence. This led to the 
idea of a jump function that can quickly compute some far-distant future element of the sequence. 
(One strength of linear generators is that it is relatively easy to construct such jump functions.) 

Another approach to parallel prRNG construction is to break the problem into two parts: first, 
make sure that the processors generate sequences that are distinct but not necessarily “random? 
then hash each generated value. An extreme variation (PHILOX) is to give every processor a 
counter, initialize the counters to widely spaced integers (the “jump function” is trivial), then 
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apply a (perhaps cryptographically strong) hash function to the successive integers produced 
by each counter [Salmon et al. 2011]. Another approach (DoTMrx) is to maintain a “pedigree” 
vector that is guaranteed to be different for every pseudorandom value to be generated (the vector 
is extended every time a parallel computation is forked, and the last entry is incremented as 
necessary); pseudorandom values are produced by hashing the vector [Leiserson et al. 2012]. The 
SPLITMix algorithm, which began as an attempt to optimize the performance of DoTMrx, uses 
Weyl] generators (counters modulo 2”, where each increment is an odd number and the increments 
are likely different for each processor) and then applies a hash function that is only moderately 
strong but much faster [Steele, Lea, and Flood 2014]; SptrrMrx had a known weakness that certain 
unfortunate choices for the increment might result in sequences that fail the test suites, but the 
designers described and implemented a technique for avoiding these poor choices in practice.' The 
SPLITMix algorithm was implemented as class SplittableRandom [Oracle Corporation 2014b] in 
the library for the Java® programming language as part of Java Development Kit 8 (JDK8). 

PHILOX was tested using TestU01; DoTMrx was tested using Dieharder; SpLirM1x was tested 
using both. In the last decade another test suite, PractRand [Doty-Humphrey 2011-2021], has 
emerged, that has largely superseded Dieharder and is especially useful as a complement to TestU01 
because PractRand is extremely fast, has the ability to “fail early,” and can be run as long (or as 
short) as desired, producing ever more precise results as it goes. 

I think nobody who is practically concerned will want to use a sequence produced by any method 

without testing it statistically, and it has been the uniform experience with those sequences that it 
is more trouble to test them than to manufacture them. Hence the degree of complication of the 
method by which you make them is not terribly important; what is important is to carry out a 
relatively quick and efficient test. [von Neumann 1951] 
It is now standard practice to use PractRand to test at least 4 terabytes (perhaps even 16 terabytes) 
of PRNG output. An algorithm that passes that test may then be given to TestU01 for final testing. 

Computers are now fast enough, and have enough main memory, that it is feasible to perform 
extremely sensitive collision tests [Knuth 1998, §3.3.2.I] that use billions of bins. 

In this paper, we return to the idea of constructing compound generators by adding outputs 
from two simple generators of different types: a linear congruential generator and a xo(ro)shiro 
generator. It turns out that PractRand can detect flaws in the low bits of these outputs, so we 
furthermore use a mixing function of the same kind used by SpiirMrx. The result is the LXM family 
of algorithms presented here, which we have tested using both PractRand and TestU01. 

The LXM algorithm is a fairly simple idea that combines building blocks already in the literature 
in ways already studied in the literature—yet this precise combination seems not to have been 


1The Weyl generator used in SPLITMrx is, in effect, a trivial linear congruential generator that “multiplies by 1” and then 
adds a constant y; the generated sequence is x, = (ky + x9) mod 2™, which is not at all random. The quality of SpLirMrx 
output depends almost entirely on the quality of its mixing function, which can turn that linear sequence into a reasonably 
random-looking sequence unless y is so poorly chosen that consecutive values x; and x;4; are the same in almost all 
bit positions—in that case, the mixing function isn’t quite good enough to map consecutive Weyl-generator outputs to 
sufficiently different final values. This can happen in two different ways: (a) y is almost all 0-bits or almost all 1-bits—then 
adding y to x; doesn’t change very many bits; (b) the low k bits of y and the low k bits of y >>> k are very similar, where 
k is the shift distance used in the first step x *= (x >>> k); of the mixing function—in that case, adding y to x, will tend 
to change two groups of bits in much the same way, and then the sHIFT-xor step of the mixing function will tend to cancel 
that change in the k low-order bits. The remaining steps of the mixing function are not good enough to compensate for this 
effective “lessening of randomness” in the first step. A further wrinkle is that such weaknesses (of either type, (a) or (b)) can 
also occur when y itself does not have either of these properties but some small integer multiple of y does; for example, 
if 3y is almost all 0-bits, then for every k, xz and xx43 will likely be so similar that the mixing function can’t map them 
to sufficiently different final values. The designers of SplitMix anticipated case (a), but not case (b) or the further wrinkle 
about small multiples. It is possible to detect and avoid these other cases, but that would make the code more complicated. 
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previously studied systematically or put into widespread practice. The principal contributions of 
this paper are explaining why specific components were chosen and why they were combined in a 
specific way, analyzing certain properties of the combination, comparing this structure to prior 
work, and empirically probing for weaknesses through detailed quality tests and timing tests. 

Section 2 describes the structure of the LXM algorithm in pragmatic terms and presents Java code 
for two versions. Section 3 explains how the split operation is performed for LXM. Section 4 defines 
special notations and terminology used in this paper. Section 5 presents a more mathematical 
description of the LXM algorithm, and Section 6 discusses properties of the algorithm, such as 
period and equidistribution. Section 7 presents results of testing for statistical quality; Section 8 
presents timing tests for both LXM and SpiiTMix. Section 9 goes into more detail about how to 
split and jump LXM generators. Section 10 provides advice on choosing a PRNG algorithm for a 
specific application. Related work is cited in Section 11; conclusions are in Section 12. 


2 THE LXM GENERATION ALGORITHM 


A member of the LXM family of algorithms for word size w (where w is any non-negative integer, 
but typically either 64 or 32) consists of four components: 


e L: a linear congruential pseudorandom number generator (LCG) with a k-bit state s, k > w 

e X: an F,-linear [L’Ecuyer and Panneton 2009] pseudorandom number generator (we use the 
term XBG, for “xor-based generator”) with an n-bit state x,n > w 

e a simple combining operation on two w-bit operands that produces a w-bit result 

e M: a bijective mixing function that maps a w-bit argument to a w-bit result 


The combining operation should have the property that if either argument is held constant, the 
resulting one-argument function is bijective; typically it is either binary integer addition ‘+’ or 
bitwise xor ‘@’ on w-bit words. In most practical applications k and n are integer multiples of w. 

The generate operation for an LXM generator is described by the following pseudocode, where 
global variable s is the LCG state (a non-negative integer), global variable t is the XBG state (a bit 
vector), multiplier m is an integer such that (m mod 8) = 5, additive constant a is an odd integer, 
and update matrix U is an n X n matrix of bits. Elements of the product of matrix U and a bit vector 
of length n are computed in the two-element field F, (addition is xor). In practice, U is chosen so 
that such matrix-vector products can be computed by using a small number of instructions such as 
XOR, SHIFT, and ROTATE operating on w-bit words. 

generate() : 
z <— mix(combine(w high-order bits of s, w bits of t)) 
s — LCG_update(s) 
t — XBG_update(t) 
return Zz 
LCG_update(s) : return (ms + a) mod 2* 
XBG_update(t) : return Ut 
This pseudocode uses the standard trick of using the old state values of the subgenerators to 
compute the result to be returned; this allows the state updates for the two subgenerators to be 
overlapped or interleaved not only with each other but with the computation of the combining and 
mixing functions, which may be advantageous on processors that can execute multiple instructions 
concurrently. 

Figure 1 shows a specific implementation in the Java programming language of the generate 
operation for w = 64,k = 64,m = 128. The period of the LCG is 2°*. The XBG is xoroshiro128 
version 1.0 [Blackman and Vigna 2018], which has a period of 2!*8 — 1. The combining function is 
binary addition. The mixing function is a variant of the MurmurHash3 mixing function [Appleby 
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private static final long M = 0xd1342543de82ef95L; // Fixed multiplier 
private final long a; // Per-instance additive parameter (must be odd) 
private long s, x@, x1; // Per-instance state (x@ and x1 are never both zero) 


public long nextLong() { 
// Combining operation 
long z = s + x; 
// Mixing function (lea64) 
Z = (z * (z >>> 32)) * Oxdaba@b6eb09322e3L; 
Z= (z * (z >>> 32)) * Oxdaba@b6eb09322e3L; 
Z= (z * (z >>> 32)); 
// Update the LCG subgenerator 
s=Mx Sta; 
// Update the XBG subgenerator (xoroshiro128v1_0) 
long qQ@ = x@, ql = x1; 
ql “= qQ; 
q@ = Long.rotateLeft(q@, 24); 
qa = q@ * ql * (ql << 16); 
qi = Long.rotateLeft(ql, 37); 
x@ = qQ; x1 = ql; 
// Return result 
return Z; 


Fig. 1. Java code for the generate operation of an LXM generator with period 24 (2128 — 1) 


2011, 2016] identified by Doug Lea. The additive parameter a may be initialized to any odd integer, 
and the state variables s, x, and x1 may be initialized to any values as long as x@ and x1 are not 
both zero. Because the periods of the subgenerators are relatively prime, the overall period of this 
LXM generator is 2°4(2!78 — 1) = 219? — 264, 

Figure 2 shows a second specific implementation, this time for w = 64, k = 128, m = 256. It uses 
the same 64-bit mixing function but uses a different (256-bit) XBG, xoshiro256 [Blackman and 
Vigna 2018]. It also illustrates some interesting engineering tradeoffs when implementing a 128-bit 
LCG using 64-bit arithmetic. Computing the (128-bit) low half of two 128-bit operands requires 
computing the 128-bit product of the (64-bit) low halves, plus the (64-bit) low halves of products of 
two pairs of 64-bit values, each consisting of the high half one of 128-bit operand and the low half 
of the other. But testing seems to show that there is little extra benefit of using a 128-bit multiplier 
over a 65-bit multiplier; on the other hand, theory tells us that a 64-bit multiplier will produce an 
LCG of lower quality [Steele and Vigna 2021]. Therefore we choose to use a multiplier of the form 
2°4 +m where m < 2° and of course (m mod 8) = 5; this eliminates one 64-bit multiplication in the 
implementation. On the other hand, there is a benefit to be gained by using a full 128-bit additive 
parameter rather than settling for 64 bits. The code uses two long values ah and al to represent 
the high and low halves of the additive parameter, and similarly uses two long values sh and s1 to 
represent the high and low halves of the LCG state. (Because Java has not yet implemented the 
method Math. unsignedMultiplyHigh, code for this operation is included in Figure 2, using the 
technique described in Hacker’s Delight [Warren 2012, §8.3, p. 175].) 

These implementations, and some others, are scheduled to be incorporated into a new package 
java.util. random as part of JDK17. This package will also include a new API intended to better 
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private static final long ML = @xd6@5bbb58c8abbfdL; // Low half of fixed multiplier 
private final long ah, al; // Per-instance additive parameter (al must be odd) 
private long sh, sl, x®, x1, x2, x3; // Per-instance state (x, x1, x2, x3 not all @) 


private long unsignedMultiplyHigh(long a, long b) { 
return Math.multiplyHigh(a, b) + ((a >> 63) & b) + ((b >> 63) & a); 
} 


public long nextLong() { 
// Combining operation 
long z = sh + xQ; 
// Mixing function (lea64) 
Z = (z * (z >>> 32)) * Oxdaba@b6eb09322e3L; 
Z = (z * (z >>> 32)) * Oxdaba@b6eb09322e3L; 
Z= (z * (z >>> 32)); 
// Update the LCG subgenerator 
// The LCG is, in effect, "s =m * s + a" where m = ((1LL << 64) + ML) 
final long u = ML * s1; 


sh = (ML * sh) + unsignedMultiplyHigh(ML, sl) + sl + ah; // High half 
sl =u+al; // Low half 
if (Long.compareUnsigned(sl, u) < @) ++sh; // Carry propagation 


// Update the XBG subgenerator (xoshiro256 1.0) 
long qQ = x@, ql = x1, q2 = x2, q3 = x3; 

long t = ql << 17; 

q2 *= q@; q3 *= q1; ql *= q2; qQ “= q3; q2 *= t; 
q3 = Long.rotateLeft(q3, 45); 

x@ = q@; x1 = ql; x2 = q2; x3 = q3; 

// Return result 

return Z; 


Fig. 2. Java code for the generate operation of an LXM generator with period 2128 (27° — 1) 


support interchangeable use of various PRNG algorithms within an application. The centerpiece 
is a new interface RandomGenerator, which provides default implementations for many standard 
methods such as nextFloat(), nextDouble(), nextGaussian(), ints(), and longs(), provided 
only that any class that implements the interface must provide a method period (for reporting 
the length of the state cycle) and either a nextLong() method (for generating a pseudorandomly 
chosen 64-bit integer) or a nextInt() method (for generating a pseudorandomly chosen 32-bit 
integer). Other new interfaces support the possibility that a specific pRNG algorithm may provide 
a jump() method (for advancing a large distance along the state cycle) or a split() method (for 
creating a new generator from an existing one, as described by Steele, Lea, and Flood [2014)]). 


3. LXM IMPLEMENTATION OF SPLITTING 


The motivation for the split operation is that each distinct instance of a splittable pRNG, once 
created, accesses only its own internal state. Instances of LXM do not communicate and do not 
share any state; therefore no synchronization is required. Methods such as longs that can generate 
streams of values in parallel rely on the standard Java spliterator mechanism [Oracle 2014]. The point 
of a spliterator is that it can (try to) split itself into two substreams that may then be processed in 
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public SplittableGenerator split() { 
return new L64X128MixRandom(this.nextLong(), this.nextLong(), 
this.nextLong(), this.nextLong()); 


Fig. 3. Java code for the split operation of an LXM generator with period 2°4(2!28 — 1) 


parallel. The spliterator for a stream that generates values from a splittable pRNG also automatically 
splits the pRNG so that each substream will have its own PRNG instance rather than trying to share 
the same one. It is this step that eliminates the need for any locking or other synchronization in 
the parallel execution of stream expressions that use a splittable pRNG. This strategy is exactly that 
already used for class SplittableRandom [Steele, Lea, and Flood 2014, §2.2]. Here we describe the 
LXM implementation of the split method and the new splits method. 


3.1. The Split Operation 


Creating a new instance of an LXM algorithm from an existing one is done in a straightforward 
way: the nextLong() or nextInt() method of the existing one is used to generate values for the 
state variables of the LCG and XBG subgenerators and for the additive parameter of the LCG. See 
Figure 3. The constructor then forces the additive parameter to be odd by setting its low-order 
bit to 1, but beyond that no additional vetting of the additive parameter (to reject “weak values” 
[Steele, Lea, and Flood 2014]) is necessary. In the unlikely circumstance that the state for the XBG 
subgenerator is entirely 0, it is necessary to force it to be nonzero; this could be done by making 
additional calls to nextLong() or nextInt(), but it is easier (and acceptable in practice, because it 
happens so rarely) to derive the new XBG state from the new LCG state. 


3.2 The Splits Operation 


Existing JDK prNc implementations, such as classes Random and SplittableRandom [Oracle Cor- 
poration 2014a,b], provide methods such as ints(), longs(), and doubles() that produce streams 
of pseudorandomly chosen values. JDK17 introduces a new method rngs() that produces a stream 
of PRNG instances; one can then use the map method of the stream to execute a piece of code many 
times, perhaps in parallel, each with its own PRNG instance so that there is no competition for 
a shared resource (such as a single, shared PRNG). PRNG algorithms that have a jump() method 
may also provide a jumps() method that is then automatically used to implement the rngs() 
method by jumping along the state cycle multiple times. On the other hand, pRNcG algorithms that 
have a split() method may also provide a splits() method that is then automatically used to 
implement the rngs() method by using the split() method multiple times—but with a bit of 
cleverness. The details of the technique are outside the scope of this paper, which focuses on how 
values are generated, and why; but we touch on it briefly in Section 9. 


4 NOTATION AND TERMINOLOGY 


We use the standard lambda notation Ax.e to denote a function that takes one argument and returns 
the value produced by the expression e with the parameter x bound to the given argument. If the 
argument is expected to be a tuple, we use a “nested destructuring parameter binding” notation; for 
example, if the argument is expected to be a 2-tuple containing a number and a 3-tuple, we could 
use a notation such as A(n, (x, y, z)).e. In this paper we usually choose to use Greek letters such as 
o and r to name parameters. 
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We work with vectors and matrices whose elements are taken from the two-element field F, 
(also known as GF(2) and Z/2Z). We casually refer to the elements of such vectors and matrices as 
bits, and we use both the symbol @ and the name xor to refer to addition within this field. We refer 
to elements and subvectors of a bit vector v by using brackets with 0-origin indexing, for example 
v[i] or o[i.. j]; “[i.. j]” (where i < j) denotes subscripting by a range of integers i to j, inclusive. 
Where necessary, we assume that any integer j in the range [0..2™) (inclusive of 0, exclusive of 2) 
may be implicitly treated as a bit vector v of length w, and vice versa, by satisfying the relationship 
j= BIE v[i]2' (where o[i] is implicitly converted to an integer 0 or 1 before multiplying by 2’). 

Let S and T be finite sets of values; we will also refer to S and T as types, in the sense that the 
value of a variable of type S must be an element of S, and similarly for T. 

For our purposes, a PRNG with states of type S and outputs of type T is a triple (so, f, g) where 
so € Sis the initial state, f : S — S is a bijective function on states, and g : S — T is a function 
from states to outputs. Such a generator produces a sequence of states so, $1, S2,... defined by the 
recurrence s; = f(s;-1) for all i > 0; it also produces a sequence of outputs to, ty, t2,... such that for 
all i > 0, t; = g(s;). Thus for all i > 0, t; = g(f'(s0)). 

Because S is finite, these sequences are periodic; because f is bijective, the sequence does not 
have a nonempty initial subsequence before commencing the periodic behavior. The period of the 
generator is the smallest P > 0 for which sp = 59; it follows that for all nonnegative integers i and k, 
Si+kp = 5; (and therefore tj4,p = ti). We sometimes refer to the finite cyclic sequence So, 51,..., Sp-1 
as the state cycle of the generator; the length of this cycle is the period P. 

We use V to refer to the bag (multiset) of outputs generated during one period of the generator, 
that is, V = 1 t; | 0 < i < PJ. We sometimes regard this multiset as a function V : T > N that 
maps each element of T to the number of times that value occurs in the multiset; in other words, it 
is the number of times that that value appears within any length-P subsequence of the sequence of 
outputs. The size of the multiset V, written |V]|, is defined to be >),<7 V(v); it follows that |V| = P. 

Sometimes a PRNG with outputs of type T is regarded as a PRNG with outputs of type T/ for some 
j > 0—that is, as generating tuples of length j, where each element of the tuple is of type T. If the 
underlying pRNG of type T is the triple (so, f, g), then the alternate view may be described by the 
derived triple (((to, tiers thsi) Sj); A(t, Try: sie Tj=1) 5 oj-1).((t1, ies 7-1, 9(f (oj-1))), f(oj-1)), 
A((t, T,--+5Tj-1)s oj-1) (15 Ty «5 Ti) In other words, the generated tuples are the (overlapping) 
length-j subsequences of the output sequence of the underlying prNe. Note that the prne of tuples 
has the same period as the underlying PRNG. 

In some prior literature, a PRNG with outputs of type T is described as “equidistributed” if the 
multiset V of values generated during each period has the property that for any two values x and y 
of type T, |V(x) — V(y)| < 1; that is, the generated values are distributed “as equally as possible” 
over the values of type T. More generally, a PRNG is described as “j-dimensionally equidistributed” 
if it is equidistributed when regarded as a generator of j-tuples as described above. Note that being 
1-dimensionally equidistributed is the same as being equidistributed. 


We introduce here a somewhat more detailed terminology: we say that a pRNG that generates 
values of type T is 6-distributed if for any two values x and y of type T, |V(x) — V(y)| < 6 lire. 
(Omitting the ceiling brackets would make this definition slightly tighter, but including them allows 
amore concise form for the 6 values that is more convenient in practice for purposes of comparison.) 
Since smaller values of 6 are better, we will normally in each case cite the smallest possible value 
of 6, and if 6 = 0, we will say that the PRNG is exactly equidistributed. More generally, we will say a 
PRNG is j-dimensionally 6-distributed if it is d-distributed when regarded as a generator of j-tuples; 


but if 6 = 0, we will say that the pRNG is exactly j-dimensionally equidistributed. 
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5 THEORETICAL CONSTRUCTION OF THE LXM ALGORITHM 


We define an LCG with state size k such that k > 3, multiplier m such that (m mod 8) = 5, additive 
parameter a such that 1 <a < 2* and a is odd, initial state So such that 0 < sy < 2k and output 
size w such that 0 < w < k, as the triple L = (50, Ao.(mo +a) mod 2k Ac. laje’ |), and we write 
to, ti, t2,... to refer to its outputs. 

We define an XBG with state size n, n-by-n bit matrix U, initial state x9 where xo is an n-bit 
vector, and output size w such that 0 < w < nas the triple X = (xo, At.Ut, At.t[0..w — 1]), and we 
write Yo, Yi, Y2,... to refer to its outputs, where 7[0..w — 1] produces a w-bit vector containing 
the first w bits of r. (We use the first w bits of r without loss of generality, because one can create 
an equivalent XBG that delivers any desired size-w subset of the state bits, in any order, by using 
some single fixed permutation to reorder the bits of the initial state and also to reorder both the 
rows and columns of the matrix U.) 

Given such an LCG and XBG, a binary combining operation on w-bit values ® (which is typically 
either + or @), and a bijective mixing function j: on w-bit values, an LXM generator is the triple 
G = ((50, Xo), A(o, r).((mo + a) mod 2k Ur), A(o, T).L( jaa" | ® t[0..w—1])). It is easy to see 
that the set of possible states of the LXM is the cross product of the sets of states of the LCM and 
XBG; that the state-update function for the LXM simply pairs an update of the LCG with an update 
of the XBG; and that the output function combines an output of the LCG with a corresponding 
output of the XBG and then applies the mixing function. 

The reader may wonder, given that the state update function of the LCG uses an affine transfor- 
mation mo + a, why the state update function of the XBG does not more generally use an affine 
transformation Ur © v. The answer has more to do with engineering than theory; we address it 
below in Sections 6.5.2 and 6.5.3. 


6 PROPERTIES OF THE LXM ALGORITHM 


In this section we discuss some properties of the LXM algorithm and how they derive from properties 
of its components. First we provide brief answers to some obvious questions; the subsections that 
follow elaborate on these answers. 

Why use two subgenerators? The usual reasons: each is fairly small and fast, and they are chosen 
so that the period of the LXM generator will be the product of their individual periods. 

Why use an XBG for one subgenerator? XGBs are fast; they are already widely used to produce 
pseudorandom sequences of fairly good quality; they have a well-understood theory, including for 
which k they are k-dimensionally equidistributed; and it is easy to scale their state size. 

Why use an LCG? An LCG whose period is a power of 2 provides exact equidistribution, and 
preserves any k-dimensional equidistribution contributed by the XBG. An LCG is fairly fast, and 
uses hardware resources (multiply and add) that may be different from those needed by the XBG. 
The LCG provides an easy way to provide an additive parameter. Finally, mixing two generators 
based on different algebraic operations may improve the quality of a PRNG. 

Why have an additive parameter? Additive parameters are an alternative to using jump functions 
to ensure non-overlap of multiple sequences, but are faster, easier to use, and easier to code. 

Why use a nonlinear mixing function? The graph of every LCG with the same multiplier m has the 
same shape, even if they have different additive parameters. A similar remark is true of a generalized 
form of XBG. Changing the parameter just shifts (and perhaps flips) the graph. Therefore the graph 
of the combined LCG/XBG part of LXM also always has the same shape (more precisely, one of two 
shapes). A good mixing function reacts nonlinearly to the additive parameter (as well as to more 
subtle linear correlations within the subgenerators). Testing confirms that a good mixing function 
appears to make different streams relatively uncorrelated, but we don’t have a theoretical proof. 
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6.1. Period 


A well-known fact about LCGs of period 2* is that for all 0 < j < k, the sequence of bits consisting 
of successive values of bit j of the overall state (where the least significant bit is bit 0 and the most 
significant bit is bit k — 1) has period 2/*!. Therefore the most significant bit has period 2*. It follows 
trivially that the sequence of w-bit values consisting of successive values of bits k — 1 through 
k — w of the overall state has period 2*. 

The XBG subgenerator of an LXM algorithm is always chosen so that the sequence of w-bit 
values consisting of successive values of a specific set of w bits within the n bits of state has an odd 
period P. Because any odd number is relatively prime to any power of 2, the overall period of an 
LXM generator will be 2*P. Note that the various xoroshiro and xoshiro algorithms each have 
the maximum possible period, 2” — 1, so an LXM algorithm that uses one of these generators as its 
XBG subgenerator will have period 2*(2” — 1). 


6.2 Scalability of Period 


The parameters k (size of LCG state) and n (size of XBG state) may be varied independently. 
When k is made very large, the cost of the multiplication operation grows quadratically (there are 
subquadratic multiplication algorithms, but they are not cost-effective for values of k within the 
range of currently practical interest), so if a larger period is desired, it may be preferable to increase 
n rather than k. Fortunately the xoroshiro family of XBG generators easily grows to support state 
sizes 2w, 4w, 8w, 16w, and beyond without a significant increase in computational cost per value 
generated (though for the specific sizes 4w and 8w, the xoshiro algorithm may be preferable). For 
w = 64 (the sweet spot for many of today’s microprocessors), practical choices for k are 64 or 128 
and for n include 128, 256, 512, and 1024, supporting periods ranging from 2!” — 2°4 to 2115? — 2128. 
For w = 32 (a sweet spot for smaller processors used in embedded applications), k = 32 and w = 64 
may be a good choice (period 2° — 2°), 


6.3. Probability of Overlapping Sequences 


Given a PRNG algorithm with a single state cycle of period P, suppose that we choose two distinct 
positions on the cycle literally uniformly at random, and then for each one consider the sequence of 
length ¢ consisting of the state at that position and the f—1 states following it. What is the probability 
that the two sequences will overlap? We care about this because long overlapping subsequences will 
produce highly correlated (indeed, identical) outputs that would not be characteristic of sequences 
of values chosen truly at random. 

By symmetry, without loss of generality we may assign the first chosen position q, the index f, 
and then choose the second position q2 uniformly at random from the range of integers [0..P — 1]. 
Overlap occurs if and only if 1 < qz < 2 — 1. The number of choices that allow overlap is 2 — 1, 
so the probability of overlap is (2 — 1)/P. 

Now suppose instead of one big state cycle of period P, we have A distinct state cycles of period 
P/A, and we do the following process twice: first choose a state cycle uniformly at random, then 
choose a position on that state cycle uniformly at random, then consider a state sequence of length 
é starting at that position. The two sequences can overlap only if they lie on the same state cycle 
(probability 1/A); if they do, the probability of overlap is (2 — 1)/(P/A) as before, so the overall 
probability is (2@ — 1)/A(P/A) = (2€ — 1)/P. Thus this intuition: breaking the big state cycle up 
into equal-sized pieces does not affect the probability of overlap. 

In LXM, the effect of having an additive parameter in the LCG is to select one of a number 
(typically 2”! or 2*-1) of state cycles (though, as we discuss below in Section 6.5.1, these state 
cycles are not terribly different), each of period 2*(2” — 1). The point we wish to make here is 
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that bits in the additive parameter are just as effective as bits in the LCG state or the XBG state in 
reducing the probability of overlap, except for the fact that the lowest bit of an additive parameter 
is “wasted” because it must be 1. As an example, let’s compare an LXM algorithm L, with k = 64 and 
n = 128 with a modified LXM algorithm Lz with k = 128 and n = 128 but the additive parameter is 1 
in every instance. Each instance of L; has 64 bits of LCG state, a 64-bit additive parameter, and 128 
bits of XBG state. Each instance of Lz has 128 bits of LCG state and 128 bits of XBG state, and it needs 
no per-instance storage for the constant additive parameter. So the per-instance storage for each of 
L, and Ly is 256 bits. For L2, the probability of overlap is (2¢— 1) /(21°8(2!8 —1)) = (2@-1)/2?°; for 
L,, the probability of overlap is (2 — 1) /(2°2°4 (2178 — 1)) ~ (2@—-1)/25°, which is the same except 
for that one wasted bit. If we let £ = 2°° and create 2°” instances of Ly, initializing their states and 
additive parameters truly at random, then the chances that two of them will have the same additive 
parameter are fairly high, thanks to the Birthday Paradox (choosing 2°° values with replacement 
from a set of 2° items), but the probability of any pair of instances overlapping is roughly 271”, 
and the probability that some pair out of the 2%” instances will overlap is roughly 2~™° (because 
2°? is quite small compared to 2!””, the effect of the Birthday Paradox can be neglected). 

It follows that, under the crucial assumption that initializing the state of newly created instances 
using the output of a PRNG is sufficiently close to truly random for this purpose, we can be confident 
that instances produced by the split() operation described in Section 3.1 are highly likely to avoid 
unwanted correlation due to accidental sequence overlap, and we can increase our confidence either 
by increasing the size of the XBG state, increasing the size of the LCG state, and/or increasing the 
number of bits in the additive parameter (remembering that this last size cannot usefully exceed 
the size of the LCG state). 


6.4 Equidistribution 


A k-bit LCG of period 2* produces each possible k-bit value exactly once during each cycle, so it is 
exactly equidistributed. The high-order w bits of the output are likewise exactly equidistributed; 
each of the 2” distinct values is produced 2*-™ times during the cycle. 

An n-bit XBG of period 2” — 1 produces each w-bit value 2””-™ times, except that there is one 
value, typically 0, that is produced only 2”-” — 1 times. Such a generator is 2~("-™) -distributed. For 
example, for w = 64, the xoroshiro128 algorithm (n = 128) is 2~°+-distributed, and the xoshiro256 
algorithm (n = 256) is 2~'°*-distributed. 

An LXM algorithm that combines two such subgenerators is exactly equidistributed, because 
each position in the period of the LCG “meets” (and is therefore combined with) each position in 
the period of the XBG exactly once during the period of the LXM generator, so for every position 
in the XBG cycle, the w-bit value in that position has added to it every possible w-bit value exactly 
2k-w times. (Applying a bijective mixing function leaves equidistribution qualities unaffected.) 

If the n-bit XBG of period 2” — 1 is (n/w)-dimensionally equidistributed—that is, using groups of 
n/w successive outputs to form (n/w)-tuples results in generating every possible tuple except one 
(call it Z, because it is typically the all-0 tuple), exactly once—then an LXM generator for which 
k = wis also (n/w)-dimensionally equidistributed; precisely put, every possible (n/w)-tuple of 
values is generated 2* times, except that if D is any (n/w)-tuple that can be generated by the LCG 
itself, then D + Z is generated by the LXM generator only 2* — 1 times. (This conclusion relies on 
the fact that k = w guarantees that no two of the 2* (n/w)-tuples generated by the LCG are equal, 
whereas this is generally not true when k > w.) 

For example, xoroshiro1 28 is 2-dimensionally equidistributed [Blackman and Vigna 2018]; using 
the terminology we define in Section 4, we can observe that xoroshiro128 is 2-dimensionally 
1-distributed, and it follows that LXM using a 64-bit LCG and xoroshiro1 28 is 2-dimensionally 
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2-°4-distributed; so both xoroshiro128 and the LXM based on it can be said to be 2-dimensionally 
equidistributed, but the LXM has a much better 6 value, reflecting the fact that it really can generate 
all possible 2-tuples, though a few of them are generated very slightly less often than the others, 
whereas for xoroshirol1 28 by itself there is one 2-tuple that is never generated. 

Similarly, we can observe that because xoshiro256 is 4-dimensionally equidistributed (more pre- 
cisely, 4-dimensionally 1-distributed), an LXM using a 64-bit LCG and xoshiro256 is 4-dimensionally 
2~°4-distributed. Likewise, LXM using a 64-bit LCG and xoshiro512 is 8-dimensionally 2~*- 
distributed, and LXM using a 64-bit LCG and xoroshiro1024 is 16-dimensionally 2~°*-distributed. 

To summarize, the LXM algorithm can improve the equidistribution properties of its XBG 
component in two ways: (1) by making the sequence of w-bit outputs exactly equidistributed rather 
than approximately; and (2) when k = w and the XBG is j-dimensionally 6-distributed for some 
j > 1, by reducing 6 by a factor of 2”. 

(We also note that for an application that makes heavy use of, say, 2-tuples of 64-bit values, one 
could use a modified version of LXM for which w = 128 and k = 128 for the LCG, but w = 64 
and n > 128 for the XBG, where for every generated 2-tuple of 64-bit values the LCG is advanced 
once and the XBG is advanced twice. The overall generator would then be exactly 2-dimensionally 
equidistributed. However, we have not yet studied or tested such a generator in any depth.) 


6.5 Why We Need a Nontrivial Mixing Function 


6.5.1 The Shape of LCG Graphs. Durst [1989] observed that, in some sense, every LCG on w-bit 
words whose period is 2” that uses the same multiplier m produces “the same sequence’; if we 
imagine a two-dimensional plot of points (i, y;), then changing the additive constant a has the 
effect of shifting the graph horizontally and vertically and possibly also flipping it top-to-bottom, 
but the overall “shape” of the graph is unchanged. (Coveyou had earlier remarked that the choice 
of addend for an LCG is “not of great importance” [Coveyou 1969, §12.1].) 

To see this, choose any specific m, a, and a’ such that (m mod 8) = 5, and a and a’ are odd, and 
consider two LCGs L = (so, Ao.(mo + a) mod 2”, Ao.) and L’ = (s}, Ao.(mo + a’) mod 2”, Ac.c). 

There are then two cases. 

(i) If (a — a’) mod 4 = 0, let r be a solution to the congruence a’ = a— (m-1)r (mod 2%); it is 

1, — a-a’ 


unique because m — 1 and a — a’ are multiples of 4, so we can rewrite it as a? = 5 (mod 2”); 


then, because m — 1 is an odd multiple of 4, mt has a multiplicative inverse modulo 2”, therefore 
r= (a) we mod 2™. Let i be the smallest nonnegative integer such that s/ = (so +r) mod 2”. 
Now an inductive argument: assume that s/ ae (sj; +r) mod 2™; then 


Siaje (ms},; +a’) mod 2” 
(m(s; +r) +a—(m-—1)r) mod 2” 

= (msj+mr+a-—mr+r) mod 2” 

= (ms;+a+r) mod 2” 

= (Sj41 +r) mod 2™ 


and we can conclude that Shi = (sj +r) mod 2” is true for all j > 0. In words, the graph of L’ is 
the result of shifting the graph of L rightward by i and upward by r, where the upward shift is 
actually a rotation modulo 2”. 


(ii) If (a — a’) mod 4 = 2, let r be a solution to the congruence a’ = (—a) + (m—1)r (mod 2™); 
it is unique because both m — 1 and a+a’ are multiples of 4, so we can rewrite it as mtr == 


(mod 2”); therefore r = (mat) ata mod 2”. Let i be the smallest nonnegative integer such that 


s; = —(So +r) mod 2”. Now an inductive argument: assume that Si = —(s; +r) mod 2”; then 
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SiejH a (ms;,,; +a’) mod 2” 
= (m(-(s; +1r)) + (—a) + (m- 1)r) mod 2” 
= (-ms; —mr-a+mr-—r) mod 2” 
(—ms; — a—r) mod 2” 
= —(Sj41 +r) mod 2” 
and we can conclude that Sij = —(s; +r) mod 2™ is true for all j > 0. In words, the graph of L’ is 


the result of shifting the graph of L rightward by i and downward by r (rotating modulo 2”), then 
flipping the graph vertically by negation of the y-axis (again modulo 2”). 

Because the output function selects the high-order bits of the LCG state, the effect is to shrink 
the graph vertically (dividing by 2”) and then to apply a floor function; thus if k > w, the shape 
still remains roughly the same, though there is some jitter. Choosing different additive parameters 
for an LCG is not, of itself, a good way to produce streams that will appear to be independent. 


6.5.2 The Shape of XBG Graphs. A similar (and simpler) argument shows that every full-period 
XBG that uses the same matrix U produces “the same sequence’; to see this, choose an n-by-n bit 
matrix U whose characteristic polynomial is primitive (therefore U is invertible), and also choose 
two n-bit vectors v and v’; then consider the two XBGs X = (x9, At.(Ut @ v), At.t[0..w — 1]) and 
X! = (xj,At.(Ut ® 0’), At.t[0.. w — 1]). By the Cayley-Hamilton theorem and primitivity of the 
characteristic polynomial, any polynomial in U of degree n— 1 or less can be expressed as a positive 
power of U [Engelberg 2015, §9.7]; it follows that because U is invertible, U © IJ is invertible. 
Now consider the equation v’ = v © (U @ I)r; because (U © J) is invertible, we can easily solve 
the equation to get the unique solution r = (U ® I)"'(u @ v’). Let i be the smallest nonnegative 


integer such that x; = x) ® r. Now an inductive argument: assume that x;,; = x; ® r; then 
, _ tana 
Xigjtt = UXj4; ®0 


U(xj @r)eve(U Slr 
Ux; @Ur@v@Urer 
= Ux;®uv@r 
= Xj+1 Or 


and we can conclude that x;, ; 
of shifting the graph of X rightward by i and “xor-flipping” the vertical axis by r. 

Thus an XBG with state update function Ar.(Ur © v) and output function At.r[0..w — 1] 
is effectively equivalent to an XBG with state update function Ar.(Ur) and output function 
(Ar.(r[0..w — 1] 4), where 6 = (((U @1)“)uv)[0..w-1] = (((U @I)?)[0..w-1;0..n—1])o. 
The graphs of all XBGs that use matrix U have “the same shape” but “xor-shifted” by a constant. 

An xor with a constant affects the bits of the XBG state independently, and the output function 
selects high-order bits of the XBG state without regard to the value of any state bit; therefore 
graphs of the output values will also have “the same shape.’ Choosing different additive parameters 
for an XBG is not, of itself, a good way to produce streams that will appear to be independent. 


= x; ®r is true for all j > 0. In words, the graph of X’ is the result 


6.5.3. The Purpose of the Additive Parameter. In the LXM algorithm, the real purpose of the additive 
parameter in the LCG is not to select one of many LCG streams in hopes that these many streams 
will appear to be independent, because they cannot. Similarly, an additive parameter in an XBG 
will not select one of many independent streams. What we have seen is that, in effect, one might as 
well use a fixed LCG and a fixed XBG, combine their outputs, then add (or xor) a parameter, then 
apply the mixing function. 
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Then why does the parameter appear in the LCG rather than later in the algorithm? It is purely 
an engineering tweak, a bit of optimization. From a theoretical point of view, we can equally well 
introduce a parameter in any of three places: in the LCG, or in the XBG (by using an F.-affine 
state update function At.Ur © v rather than the purely F,-linear state update function At.Ur), or 
by using a combining function such as A(p, q).p + q + a. (We could even introduce parameters 
in two, or all three, of those places, but there seems to be little extra benefit.) We observe that 
introducing the parameter in the XBG or the combining function requires “extra work”—perhaps 
one additional instruction—on today’s typical hardware architectures, but the LCG needs to add 
some odd value in order to have full period, and it’s easy to make that odd value be a parameter 
rather than a constant. Moreover, in the style of coding where the LCG update and XBG update are 
potentially computed in parallel with the combining and mixing functions, and given that a good 
mixing function takes longer to compute than the LCG update, adding the parameter in the LCG 
rather than in the combining step moves that addition operation off the critical path. 

The hope, then, is that the additive parameter, despite being implemented at part of the LCG, will, 
in effect, select one of many mixing operations. In order to achieve this result, the mixing function 
certainly needs to be nonlinear, and ideally its range will appear to be a random permutation of its 
domain. Beyond this point theory offers us little firm guidance, and so we turn to empirical testing. 


7 TESTING 


We consider the TestU01 BigCrush test suite [L’Ecuyer and Simard 2007; Simard 2009] to be the 
current gold standard for final testing of any pRNG algorithm before deployment. However, we 
found PractRand [Doty-Humphrey 2011-2021] to be an extremely useful additional tool for two 
purposes: experimental exploration (because it fails fast on poor pRNG algorithms) and evaluating 
relative degrees of weakness (because the length to which a tested sequence must grow before 
failure is reported appears to be a more sensitive and repeatable metric than the p-value calculated 
for a sequence of fixed length). An algorithm that passes PractRand at the 4 TB threshold is worthy 
of final testing with BigCrush. 

In testing variations of the LXM algorithm, we have performed over 52,000 complete runs of 
PractRand and over 50,000 complete runs of TestU01 BigCrush. For reasons of space we cannot 
present all the results of these tests, but we do present and describe tables that summarize salient 
results from BigCrush, and we describe and summarize in prose form salient results from PractRand. 


7.1. Test Framework 


We built a small testing framework to control thousands of test runs of multiple pRNG algorithms, 
using both the BigCrush test suite and the PractRand test suite. 

Nearly all the tests were performed on a cluster of 16 nodes, each with two sockets, each with an 
E5-2660 2.2GHz Intel Xeon processor (each having eight cores collectively supporting 16 threads). 
Therefore 512 threads can execute simultaneously. (A very small fraction of the tests were run on a 
Macintosh Pro with two 2.8 GHz quad-core Xeon processors. This was done to validate the testing 
software before reserving time on the big cluster. The results of these initial runs constituted valid 
measurements and were retained.) 

We made no attempt to parallelize the PractRand and BigCrush test suites; instead, we used 
make files to generate thousands of jobs at a time. Each make file describes one batch of test runs. 
Each make file includes code to find out which of the compute nodes it is being run on, so that a 
different subset of the batch of test runs will be run on each node. The use of make files allowed a 
very simple form of crash recovery: simply a matter of re-issuing the make command. 

Each individual run tested the behavior of one pRrNG algorithm, starting it from one specific state 
and testing the statistical quality of its output stream. While BigCrush and PractRand differ in the 
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kinds of statistical tests they employ and the way they report the results of their analysis, they are 
alike in four key ways: 


e There is a simple way to code new pRNG algorithms in C (or C++) and link them into the test 
suite. (This avoids the I/O overhead for piping the pRNG output stream into the test suite.) 

e Results are reported by printing text to “standard output”; each report includes statistical 
information and also an indication of the total amount of CPU time (user execution time) 
consumed by the test. 

e Each has a command-line interface that allows specification of which prNG algorithm to test. 

e The command-line interface does not allow a complete specification of the initial state of 
the prNG, but does allow specification of a 64-bit seed from which the initial state can be 
constructed, and the construction code can be user-specified and bundled with the code for 
the prNG algorithm itself. 


We designed a detailed encoding that would allow us to use the single 64-bit integer parameter in 
the command line to specify a wide variety of initial states. 


7.1.1 Distilling BigCrush Reports. The BigCrush test suite runs 106 individual tests [L’Ecuyer 
and Simard 2013, function bbattery_BigCrush, pp. 148-152], computing 160 test statistics and 
p-values [L’Ecuyer and Simard 2007]. A single test run typically prints about 110 kilobytes of 
information; at the end is either the message “All tests were passed” or a list of anomalies, that 
is, tests whose p-values were outside the range [0.001 .. 0.999]. 

For every algorithm tested with TestU01, we ran the entire suite three times, once in each of 
three distinct modes, identified by the letters f, g, and u. The f mode generates double values by 
generating a 64-bit integer, then right-shifting it by 11 and dividing by 2°° to produce a value in the 
range [0.0..1.0). The g mode generates double values by generating a 64-bit integer, reversing 
the order of its bits so that bit j becomes bit 63 — j, then right-shifting it by 11 and dividing by 2°°. 
The u mode generates double values by generating a 64-bit integer, then dividing each half (first 
the low half, then the high half) by 2°” to produce two double values, one after the other. (Late in 
our testing process we added a fourth mode, w, which generates double values by generating a 
64-bit integer, then reversing the bit order of each half and dividing by 2°.) As it turned out, we 
observed in the measured results no obvious differences between testing modes. 

The distillation software for BigCrush test runs distills the list of anomalies for each test run 
into a pair of integers (J, c) (a warning level and a count) in this manner: If a test run file is missing, 
then (J, c) = (—1, 0). Ifa test run file is present but is incomplete or malformed, then (1, c) = (—2, 0) 
(this can happen if a test run was terminated before completion). If a test run file is present and all 
tests were passed (107° < p < 1— 107°), then (J, c) = (0, 0). Otherwise, the test run file was present 
and well-formed but reported one or more anomalies. Each anomaly is categorized according to its 
reported p-value (or, if p > 0.5, by using 1 — p) into one of seven warning levels: if p is eps then 7, 
else if p is eps1 then 6, else if p < 10-'? then 5, else ifp< 10~° then 4, else ifp< 10~° then 3, else 
if p < 10-4 then 2, else if p < 107 then 1; then / is the highest warning level among all anomalies 
for the test run, and c is the number of anomalies having that highest warning level. We regard a 
run as a complete failure if I is 6 or 7. 


7.1.2 Distilling PractRand Reports. The PractRand test suite is oriented toward testing 64-bit integer 
values and includes tests specifically designed to probe weakness in the low-order bits, so we used 
PractRand directly on the generated 64-bit values and made no attempt to define multiple testing 
modes. We ran each test until it had either reported failure or generated 4 TB of test data. 

A single test run that gets all the way to 4 terabytes typically prints about 5 kilobytes of informa- 
tion. For each anomaly reported, PractRand prints not only a p-value but also a word or phrase 
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Table 1. Some of the “magic constants” used in testing. Multiplier values m are presented in decimal form and 
are among those recommended by L’Ecuyer [1999b, Table 4, p. 258]; all others are presented in hexadecimal 
form and are random values originally obtained from HotBits [Walker 1996], with A values forced to be odd. 


32 bits 
Ag = @x4E1FD53B 
Ajo = @x95@F5BFF 
Aj2 = @xFB999853 
64 bits 


Mz = 2891336453 
m4 = 29943829 
Me = 32310901 


Sg = @x4C3CA493 
Sio = 0X734B1FEF 
Siz = @x36BAEQ16 


M2 = 2862933555777941757 
mg = 3202034522624059733 
Me = 3935559000370003845 


Ag = 0x856FA2A9BC6917B7 
Ajo = 0x873CQF33448D2C35 
Aj = 0xD321702ECD7BDA75 


Sg = @xCFEADA5EE4037657 
Si9 = 0x@D1729016D5CA71D 
Sip = OxAF5AA696D8CO97F 6 


describing that p-value; in increasing order of severity, they are unusual, suspicious, SUSPICIOUS, 
very suspicious, VERY SUSPICIOUS, and FAIL. (PractRand may further print a varying number of 
exclamation points after the word “FAIL” but we chose to ignore those: failure is failure.) We relied 
on these nonnumerical descriptions in distilling the reports into a pair of integers (J, c) (a warning 
level, ranging from 1 for unusual to 6 for FAIL, and a count) in a manner similar to that used for 
BigCrush. In addition, for each warning, the amount of data processed is recorded. 


7.2 Results of BigCrush Tests 


Table 1 lists some constants that are referred to by name in later tables. Not shown for lack of space 
are similar constants Xg, Xj9, and X12; also not shown are similar 16-bit and 128-bit constants. 

Table 2 and other tables after it present summarized BigCrush results; the BTgX source for these 
tables was generated automatically by the distillation software described in Section 7.1. Each line 
of the table summarizes a set of tests that differ only in stream count (the number of instances 
whose outputs are used in round-robin fashion) and mode. The first line of the table’s footer shows 
the total number of test runs and the total CPU-thread time expended; the second line shows the 
set of stream counts and set of modes used for every line in the table. 

For each line in the table, the first three columns show w, k, and n. The next two columns name 
the mixing function and initialization strategy. The next five columns give m, a, so, Xo, and the 
combining function (+ or ©); if a value is underlined, then every instance uses the indicated value; 
otherwise each instance uses a value generated by some other instance in a manner dictated by the 
particular initialization strategy. N is the total number of test runs for that line of the table. The 
next eight columns show the number of test runs whose highest warning level was 0, 1, 2,..., 7; 
recall that warning levels 6 and 7 indicate complete failure (see Section 7.1.1). The last two columns 
give the total number of warnings () and the smallest p-value (Pyyo;s¢) seen during the N runs. 

Figure 4 shows C code for mixing functions we tested: murmur32 and murmur64 [Appleby 2011]; 
degski32 and degski64 [degski 2018]; lea64, by Doug Lea [Lea 2013]; starstar16 [Blackman 
and Vigna 2018] (a scrambler, never intended to be a mixing function); and madeup16, by one of us. 

These are the five initialization strategies that appear in the tables (let x be the stream count, 
and it is implicitly understood that as the non-underlined values for an instance are filled in, the 
underlined values are also filled in as specified in the table): 


same uses the listed m, z, so, and xo values to create a single extra instance of the LXM, outputs 
of which are used to initialize non-underlined values for the x instances to be tested. 

tree b uses m, z, So, and Xo to initialize instance 0, then for all 1 < j < x in ascending order, 
output from instance |j/b] is used to initialize non-underlined values for instance j. 
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uint64_t degski64(uint64_t z) { 
Zz *= (Zz >> 32); 
Z *= Oxd6e8feb86659fFd93ul11; 
Zz *= (z >> 32); 
Z *= Oxd6e8feb86659Fd93ul11; 
return z * (z >> 32); } 


uint64_t lea64(uint64_t z) { 
Zz *= (z >> 32); 
z x= QxdabaQb6eb09322e3ull; 
Zz *= (Zz >> 32); 
z *= QxdabaQb6eb09322e3ull; 
return z * (z >> 32); } 


uint32_t murmur32(uint32_t z) { 
z *= (z >> 16); 
Z *= @x85ebca6bul ; 
z *= (z >> 13); 
Z *= Qxc2b2ae35ul; 
return z * (z >> 16); } 


uint64_t murmur64(uint64_t z) { 
Zz *= (z >> 33); 
z x= Oxff51afd7ed558ccdull; 
Zz *= (z >> 33); 
Z *= O@xc4ceb9fel1a85ec53ull; 
return z * (z >> 33); } 


uint32_t degski32(uint32_t z) { 
z *= (z >> 16); 
z *= Qx45d9f3bul; 


uint16_t starstar16(uint16_t z) { 
Zz=2Zz%* 5; 
return ((z << 7) | (z >> 9)) * 9; 3 


z *= (z >> 16); uint16_t madeup16(uint16_t z) { 
z *= 0x45d9f3bul; Z = (uint16_t)((z * (z >> 8)) * @xca6b); 
return z * (z >> 16); } Z = (uint1l6_t)((z * (z >> 9)) * Qxae35); 


return (uint16_t)(z * (z >> 8)); } 


Fig. 4. Mixing functions used during testing 


skip uses m, Z, so, and Xo to initialize instance 0, then for all 1 < j < x in ascending order, all 
non-underlined values for instance i are copied from those of instance i — 1 and then the 
state of the XBG is advanced one position. 

jump is the same as skip, except that the XBG is advanced by 2”/? positions. 

leap is the same as skip, except that the XBG is advanced by 23”/4 positions. 


7.2.1. Scaling the Number of Streams. Table 2 shows results from LXM instances that use a 64-bit 
LCG, either xoroshiro128 or xoshiro256, and either one of three mixers or none. The combining 
function is + (addition). They are tested for stream counts 1, 2, 4, 8, 16,..., 274 and also three other 
non-power-of-two stream counts, chosen arbitrarily. For each stream count x, five different initial- 
ization procedures are tested: same, tree 2, skip, jump, and leap. We observe BigCrush fails only 
the cases that use no mixing function and use skip, jump, or leap initialization. All three mixing 
functions appear to be equally effective in this set of tests. 

We ran similar tests using a 128-bit LCG (with a 64-bit multiplier and either a 64-bit or 128-bit 
additive parameter) and xoroshiro1 28 for the XBG, using the same set of stream counts and the 
same five initialization procedures. The results were quite similar to those in Table 2. 


7.2.2 Tree-shaped (Potentially Parallel) Initialization Strategies. Table 3 shows BigCrush results 
from LXM instances that use a 64-bit LCG, either xoroshiro128 or xoshiro256, and no mixing 
function. The combining function is + (addition). They are tested for stream counts 28 212 914 917 
271, and 2”4. For each stream count, six different branching factors for the tree are tested: 3, 4, 5, 16, 
32, and 256 (the tests shown in Table 2 cover the case of branching factor 2). None of these tests 
fail. Out of 216 tests, just one has a warning level as high as 3. 


7.2.3 Instances with Very Similar Additive Constants. Table 4 shows BigCrush results from LXM 
instances with k = 32 andn = 64,k = 32 andn = 128,k = 64 andn = 128, ork = 64 and 
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Table 2. Test measurements for scaling the number of streams (see Section 7.2.1) 


148:19 


wk n|mixer init ma Ss xX» @|N| 0 1 2 3 4 5 6 7| XZ | pworse 
64 64 128|}murmur64 same jm, Ag Sg Xg +|84/61 19 3 1 28) 2.@E-7 
64 64 128|murmur64 tree2|m, Ag Sg Xg +|84|54 26 4 37/3.@E-5 
64 64 128|murmur64 skip jm, Ag Sg Xg +]|84/59 19 6 29/3.@E-6 
64 64 128|murmur64 jump |m, Ag Sg Xg +]|84]/58 22 4 34/3.8E-5 
64 64 128}murmur64 leap |m, Ag Sg Xg +|84/65 15 4 24|6.0E-5 
64 64 128|degski64 same |m, Ag Sg Xg +|84/56 19 9 36|3.7E-5 
64 64 128|degski64 tree2|m, Ag Ss Xg +|84/64 12 8 24/1.0E-5 
64 64 128|degski64 skip |m, Ag Sg Xs +|84|/58 22 4 31/1.6E-6 
64 64 128|degski64 jump |m, Ag Ss Xg +|84/61 18 5 30|2.3E-5 
64 64 128/degski64 leap |m, Ag Sg Xg +|/84/57 20 7 32|3.@E-5 
64 64 128] lea64 same |m, Ag Sg Xg +|84/63 19 1 1 24|2.8E-7 
64 64 128]lea64 tree2|m, Ag Sg Xs +|84/60 20 4 30/4. 4E-6 
64 64 128]lea64 skip |m, Ag Sg Xg +/84/62 21 1 26|9.4E-5 
64 64 128]lea64 jump jm, Ag Sg Xg +|84/52 26 6 40|2.8E-5 
64 64 128]lea64 leap |m, Ag Sg Xg +|84|56 18 10 35|2.7E-6 
64 64 128|none same |m, Ag Sg Xg +|84/50 29 5 38/4.6E-5 
64 64 128|]none tree2|m, Ag Ss Xs +|84/57 23 4 33/1.1E-5 
64 64 128]}none skip M, Ag Sg Xg +/84) 2 0 1 0 0 O O 81/6406] eps 

64 64 128)none jump Mm, Ag Sg Xg +|/84/41 7 2 0 0 O 1 33] 103] eps 

64 64 128]none leap mM, Ag Sg Xg +/84/43 5 3 0 0 O O 33] 104] eps 

64 64 256|murmur64 same jm, Ag Sg Xg +|84/63 16 5 23|7.1E-6 
64 64 256|murmur64 tree2;)m, Ag Sg Xg +]84/55 23 6 36/4.4E-6 
64 64 256|murmur64 skip jm, Ag Sg Xg +]|84/57 21 6 32)1.1E-5 
64 64 256|murmur64 jump jm, Ag Sg Xg +|84)66 15 3 21)3.2E-5 
64 64 256|murmur64 leap jm, Ag Sg Xg +]|84/59 20 5 30/1.9E-5 
64 64 256|degski64 same |m, Ag Sg Xg +|/84/60 17 7 30]1.8E-5 
64 64 256|degski64 tree2|m, Ag Ss Xz +|84/60 20 4 26/8.1E-5 
64 64 256|degski64 skip |m, Ag Ss Xg +|84/51 27 6 39/1.6E-5 
64 64 256|degski64 jump |m, Ag Ss Xs +|/84/62 19 2 1 31|/2.4E-7 
64 64 256|degski64 leap |m, Ag Sg Xs +|84/58 22 4 30/1.1E-5 
64 64 256|lea64 same |m, Ag Sg Xz +|84/63 18 3 22|7.4E-6 
64 64 256|lea64 tree2|m, Ag Sg Xs +/84/53 24 7 36|1.9E-6 
64 64 256|1lea64 skip |m, Ag Ss Xs +[84/59 18 7 26|3.7E-6 
64 64 256|lea64 jump jm, Ag Sg Xg +|84/63 18 3 26|3.0E-5 
64 64 256|lea64 leap |m, Ag Sg Xg +|84/62 17 5 27|2.4E-5 
64 64 256|none same |m, Ag Sg Xg +|84|55 27 2 38/4.5E-5 
64 64 256|none tree2|m, Ag Sg Xs +/84/52 30 2 41|3.2E-5 
64 64 256|none skip m, Ag Sg Xj +/84) 2 1 0 0 0 O O 81/6318] eps 

64 64 256|none jump Mm, Ag Sg Xg +|84/32 17 2 0 0 0 O 33 95| eps 

64 64 256|none leap mM, Ag Sg Xg +|84/38 13 0 0 0 0 O 33 86} eps 


3360 complete runs of BigCrush 


Stream counts used: { 2/ | 0 < j < 24} U { 1900547, 5242880, 12582912 } 


Total CPU-thread time: 1433 days + 13:31:27 


Modes used: u f g 
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Table 3. Test measurements for tree-shaped (potentially parallel) initialization strategies (see Section 7.2.2) 


wk nimixer init ma 5% xX @N| 0123 45 6 7d Pworst 

64 64128 none tree3 |m, Ag Sg Xs +/18/10 6 2 8, 1.8E-5 
64 64128none treed (m, Ag Sg Xg +/18/15 3 3] 4.8E-4 
64 64128none tree5 |m, Ag Sg Xg +/18/10 6 2 9} 3.3E-5 
64 64128/none tree 16 m, Ag Sg Xg +/18] 9 9 11] 1.3E-4 
64 64 128 none tree 32 |m, Ag Ss Xg +/18/10 6 12; 1.6E-5 
64 64 128|none tree 256m, Ag Sg Xg +/18/13 4 1 5] 1.@E-4 
64 64 256 none tree3 im, Ag Sg Xg +/18/15 3 3) 1.1E-4 
64 64 256 none treed |m, Ag Sg Xg +/18) 9 6 3 9} 1.8E-5 
64 64 256 none tree5 (m, Ag Sg Xg +/18/11 7 7| 1.5E-4 
64 64 256 none tree 16 |m, Ag Sp Xg +|18/10 6 1 1 9] 2.2E-7 
64 64 256\none tree 32 |m, Ag Ss Xg +/18/14 3 1 7| 4.2E-5 
64 64 256 none tree 256\m, Ag Sg Xg +/18) 9 6 3 12; 4.1E-5 
216 complete runs of BigCrush Total CPU-thread time: 96 days + 15:34:05 
Stream counts used: { 2°, 217, 2!4, 217, 271, 224 } Modes used: u f g 


n = 256. The combining function is + (addition). We tested all 200 combinations of 25 stream 
counts ({ 2/ | 0 < j < 24}), two different multipliers m, and mg for the LCG, 2 mixing functions 
(none, or murmur of the appropriate word size), and two ways to choose the additive constants. The 
initialization strategy was same in all cases, except that the additive constants were chosen to be 
very similar: for stream count x, for 0 < i < x, the additive parameter was either 1 + 32i or Ag + 32i. 
All cases with no mixing function and a stream count below 1024 fail. All cases using a murmur 
mixer passed, and out of 2000 tests, just one has a warning level as high as 3. 

On the other hand, certain contrived tests fail BigCrush spectacularly: if the initial states so and 
xo of two instances are identical (a situation unlikely in practice) and on top of that their additive 
constants a differ only in the high-order bit (even less likely), then the values produced by the 
combining function will differ only in the high-order bit, and it’s asking too much of a fast mixing 
function to produce apparently independent streams from such inputs. 

We conclude that the mixing function may play a valuable defensive role when the additive 
constants of the LCGs are somewhat similar, but in very rare cases may fail to do the job; it’s 
important to try to initialize multiple instances to very different states. 


7.2.4 Instances That Use xor for the Combining Function. Table 5, which may be compared with 
Table 4, shows BigCrush results from LXM instances with either k = 32 and n = 64, or k = 64 and 
n = 256. The combining function is  (xor). As in Section 7.2.3, we tested all 200 combinations of 
25 stream counts ({ 2/ | 0 < j < 24}), two different multipliers m, and me for the LCG, 2 mixing 
functions (none, or murmur of the appropriate word size), and two ways to choose the additive 
constants. The initialization strategy was the same in all cases, except that the additive constants 
were chosen to be very similar: for stream count x, for 0 < i < x, the additive parameter was either 
1+ 32i or Ag + 32i. All cases with no mixing function and a stream count below 1024 fail. All cases 
using a murmur mixer passed, and out of 1000 tests, just two have a warning level as high as 3. (We 
also tested multiplier m2; the results, not shown here for lack of space, were similar.) 

We conclude that when a good mixing function is used, using xor for the combining function 
appears to be no worse than using addition (but note that using xor may be significantly worse 
than using addition if no mixing function is used [Marsaglia 1985]). 
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Table 4. Test measurements for instances with very similar additive constants (see Section 7.2.3) 


w k  n|mixer init | m a So X @| N|O 1 2 3 4 5 6 7| ZX] Pworse 
32 32 64]none same |m, 14321 Sg Xs +/125/55 18 4 0 O O 37 11/105) eps 
32 32 64]none same |m, Agt32i Sg Xg +/125/58 17 2 0 O O 37 11]112] eps 
32 32 64]none same jm, 1+32i Sg Xg +/125/58 17 2 0 O O 37 11] 97} eps 
32 32 64]none same |™m, Agt32i Sg Xg +/125/58 14 5 0 O O 37 11]107| eps 
32 32 64/murmur same |}m, 1+32i Sg Xg +/125|/87 32 6 44|3.5E-5 
32 32 64/murmur same |m, Agt32i Sg Xg +/125|82 36 7 52|1.8E-6 
32 32 64/murmur same |m, 1+32i Sg Xg +/125|93 27 5 35|4.1E-5 
32 32 64/murmur same |m, Agt32i Sg Xg +/125)95 24 5 1 38|5.8E-7 
32 32 128]}none same jm, 1+32i1 Sg Xg +/125/64 15 5 0 0 0 35 6] 86] eps 
32 32 128/none same |m, Agt32i Ss Xs +[125|61 17 6 0 0 0 35 6) 93] eps 
32 32 128]}none same jm, 1+32i Ss Xg +|125/64 11 9 0 0 O 35 6/101} eps 
32 32 128)none same |m, Ast32i Ss Xs +|125|61 19 4 0 0 0 35 6/ 84] eps 
32 32 128}/murmur same jm, 1+32i Sg Xg +]125/90 29 6 41/2.4E-5 
32 32 128}/murmur same jm, Agt32i Sg Xg +]125/77 45 3 54|8.7E-6 
32 32 128/murmur same |m, 1+32i Sg Xg +|125)/82 34 9 58) 1.6E-6 
32 32 128}murmur same jm, Agt32i Sg Xg +]125/85 35 5 47|4.6E-5 
64 64 128]}none same jm, 1+32i Sg Xg +/125/77 28 9 0 0 0 9 2] 59] eps 
64 64 128]}none same |m, Agt32i Sg Xg +/125/79 30 4 1 0 0 9 2] 57] eps 
64 64 128]}none same |m, 1+32i Sg Xg +/125/78 28 7 1 0 0 9 2] 63} eps 
64 64 128]none same |m, Ag+32i Sg Xg +|125/76 38 0 0 0 0 9 2] 59) eps 
64 64 128/murmur same |m, 1+32i Sg Xg +/125/89 31 5 46|1.1E-5 
64 64 128/murmur same |m, Agt32i Sg Xg +/125|91 27 7 36/1.1E-5 
64 64 128/murmur same |m, 14+32i Sg Xg +/125|)87 31 7 45|8.@E-5 
64 64 128/murmur same |m, Agt32i Sg Xg +/125|/86 33 6 47|2.7E-5 
64 64 256]none same jm, 1+32i Ss Xg +1125/89 19 5 2 0 0 9 1| 43} eps 
64 64 256/none same |m, Ag+32i Sg Xg +/125/83 29 3 0 0 0 9 1| 53] eps 
64 64 256/none same |m, 1+32i Ss Xg +|125/80 28 7 0 0 0 9 1] 56} eps 
64 64 256/none same |m, Agt32i Ss Xg +[125/83 26 5 1 0 0 9 1) 52] eps 
64 64 256/murmur same jm, 1+32i Sg Xg +]125/88 35 2 42/8. 2E-6 
64 64 256/murmur same jm, Agt32i Sg Xg +]125/87 30 8 44|3.4E-5 
64 64 256/murmur same |m, 1+32i Sg Xg +]125/84 33 8 48|6.7E-6 
64 64 256/murmur same jm, Agt32i Sg Xg +]125/93 26 6 40|6.3E-6 
4000 complete runs of BigCrush Total CPU-thread time: 1845 days + 12:33:10 
Stream counts used: { 2/ | 0 < j < 24} Modes used: u v w f g 


7.2.5 Scaling the State Size. One way to see how a family of pRNGs behaves is to consider the 
behavior of very small members of the family. We tested three small variants: w = 32, k = 32, 
n = 128; w = 32, k = 32, n = 64; and w = 16, k = 16, n = 32. In each case the combining function 
was addition. 

Small prncs: Table 6 shows BigCrush results for w = 32, k = 32, and n either 64 or 128. The 
64-bit XBG algorithm is xoroshiro64 [Blackman and Vigna 2018], that is, 


{ ql “= q0; q@ = (q@ << 26) | (q@ >> 6); qa = q@ * ql * (ql << 9); 
ql = (ql << 13) | (ql >> 19); } 
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Table 5. Test measurements for instances that use xor for the combining function (see Section 7.2.4) 


w k  n|mixer init | m a So X @| N|O 1 2 3 4 5 6 7| XZ] Pworse 
32 32 64/none same jm, 1+32i Sg Xs @|125/47 16 2 0 0 0 35 25/149) eps 
32 32 64/none same |m, Agt32i Sg Xg @/125]49 13 1 2 0 0 35 25/151] eps 
32 32 64/none same/m, 14+32i Sg Xg @/125/48 13 4 0 0 0 35 25/142) eps 
32 32 64/none same |m, Agt32i Sg Xg @/125]48 12 5 0 0 0 35 25/171) eps 
32 32 64/murmur same |}m, 14+32i Sg Xg @/125)94 29 2 38|4.6E-6 
32 32 64/murmur same |m, Agt32i Sg Xg @/125|97 25 3 33)3.7E-5 
32 32 64/murmur same |m, 14+32i Sg Xg @/125/88 32 5 42|2.1E-5 
32 32 64/murmur same |m, Agt32i Sg Xg @/125|/86 34 5 46|2.2E-5 
64 64 128/none same|m, 14+32i Sg Xg @/125|76 24 4 0 0 0 7 14| 64] eps 
64 64 128)none same |m, Agt32i Ss Xs ©|125|/78 19 7 0 0 0 7 14) 64] eps 
64 64 128/none same|m, 1+32i Sg Xg @|125|/70 31 3 0 0 O 7 14] 81] eps 
64 64 128)none same |m, Ast32i Ss Xs ©|125|68 33 3 0 0 O 7 14] 83] eps 
64 64 128/murmur same |m, 1+32i Sg Xg @|125/85 31 8 1 49|7.6E-7 
64 64 128/murmur same |m, Agt32i Sg Xs @|125/82 35 7 1 57|2.QE-7 
64 64 128/murmur same jm, 1+32i Sg Xg @|125)/81 35 9 51)2.3E-5 
64 64 128/murmur same jm, Agt32i Sg Xg ©]125/91 28 6 40)1.9E-5 
2000 complete runs of BigCrush Total CPU-thread time: 832 days + 23:59:39 
Stream counts used: { 2/ | 0 < j < 24} Modes used: u v w f g 


with output qQ. The 128-bit XBG algorithm is xoshiro128 [Blackman and Vigna 2018], that is, 
{ uint32_t t = ql << 9; q2 *= q0; q3 *= q1; ql *= q2; qQ *= q3; 
q2 *= t; q3 = (q3 << 11) | (q3 >> 21); } 

with output q1. We tested all 240 combinations of 25 stream counts ({ 2/ | 0 < j < 24}), 4 mixing 
functions (none, murmur32, degski32, and 1ea32), and 2 initialization strategies (same and tree 
2). For n = 64, the version with no mixer always failed when the number of streams was less than 
64; for n = 128, the version with no mixer always failed when the number of streams was less than 
16. In all other cases, no warning level worse than 2 was observed, except for one case with n = 64 
and stream count 256, which had warning level 3. 

Very small prnGs: Table 7 shows BigCrush results for w = 16, k = 16, n = 32; the 32-bit XBG 
algorithm is 

{ q *= (q << 13); q *= (q >> 17); g *= (a << 5); 

which uses one of the triples of shift constants recommended by Marsaglia [2003, §3]. We tested all 
240 combinations of 40 stream counts ({ 2/ | 0 < j < 24} U {256+ 16j|1< j < 15}), 3 mixing 
functions (none, starstar16, and madeup16), and 2 initialization strategies (same and tree 2). The 
version with no mixer always failed when the number of streams was less than 336; no warning 
level worse than 2 was observed for stream counts above 367. The starstar16 mixer produced 
no warning level worse than 2. The madeup16 mixer (so called because its constants were chosen 
at whim, with no attempt to optimize avalanche statistics) also produced no warning level worse 
than 2. So even at this very small scale we see that, on the one hand, even a simple mixing function 
clearly improves the quality, and on the other hand, even a simple mixing function suffices to get 
adequate quality. Focusing on the single-stream case, we find it remarkable that a pRNG with just 
48 bits of state is able to pass BigCrush, and that (with the madeup16 mixer) PractRand tests 1 TB 
of its output (2°° generated values) before failing it (see Section 7.3). 
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Table 6. Test measurements for scaling the state size: 32-bit generators (see Section 7.2.5) 


w k_ n\mixer init ma 5x @®|N}0 123 45 6 7d Pworst 

32 32 64/none same jm, m, Ag Sg Xg +|75|3419 3 1 0 015 3/49 eps 

32 32 64/none tree 2/m, Ag Sg Xg +|75|/3718 2 0 0 015 3/52 eps 

32 32 64)murmur32 same |m, Ag Sg Xg +/75/41 29 5 37 8.7E-6 

32 32 64)murmur32 tree 2)m, Ag Sg Xg +/75]53 17 5 25 7.5E-6 
3232 64|degski32 same |m, Ag Sg Xg +|75)/49 24 2 27 2.3E-5 
3232 64/degski32 tree 2m, As Sg Xs +|75/48 25 2 30|  4.@E-5 
3232 64/lea32 same jm, Ag Sg Xg +/75]48 22 5 33 3.9E-5 
3232 64|lea32 tree 2|m, Ag Sg Xg +|75|57 15 3 20] 5.3E-5 

32 32 128|/none same jm, m, Ag Sg Xg +|75|/42 19 2 0 0 012 36 eps! 

32 32 128|none tree 2\m, Ag Sg Xg +|75/4417 2 0 0 012 [39] — epst 

32 32 128)jmurmur32 same |m, Ag Sg Xg +|75|5219 4 31 2.5E-5 

32 32 128)murmur32 tree 2)m, Ag Sg Xg +/75/52 21 2 29 5.1E-6 

32 32 128|degski32 same m, Ag Sg Xg +|75|60 11 4 17/ 1.1E-5 

32 32 128|\degski32 tree 2|m, Ag Sg Xs +|75/55 15 5 25] 1.9E-5 

32 32 128|lea32_ same _|m, Ag Sg Xg +|75|61 11 3 16] 5.3E-5 

32 32 128/lea32__—i tree 2/m, Ag Sg Xg +|75/56 17 2 24 3.5E-5 
1200 complete runs of BigCrush Total CPU-thread time: 599 days + 19:18:47 
Stream counts used: { 2/ | 0 < j < 24} Modes used: u f g 


Table 7. Test measurements for scaling the state size: 16-bit generators (see Section 7.2.5) 


w k_ njmixer init ma 5% x ®| N}0 123 45 672 Pworst 
16 16 32|none same |m, Ag Sg Xs +/120|38 2810 2 0 022 20|203] eps 
16 16 32|none tree 2|m, Ag Sg Xg +/120/52 15 8 2 1 022 20/193] eps 
16 16 32\madeup16 same |m, Ag Sg Xg +/120/85 30 5 48} 5.6E-6 
16 16 32\madeup16 tree 2|m, Ag Sg Xg +/120/77 38 5 47| 1.6E-5 
16 16 32|starstar16 same |m, Ag Sg Xg +)120/8135 4 47| 2.3E-5 
16 16 32|starstar16 tree 2|m, Ag Sg Xg +/120/86 30 4 40} 2.5E-5 


720 complete runs of BigCrush Total CPU-thread time: 489 days + 6:50:16 
Stream counts used: { 2/ | 0 < j < 24} U{28424k |1<k<15} Modes used: uf g 


7.2.6 LCG Multipliers. Most of our testing has used multipliers recommended by L’Ecuyer [1999b, 
Table 4, p. 258]; we have also run tests using multipliers recently discovered by Steele and Vigna 
[2021, Table 5, p. 17]. We have not detected any significant difference in test results; if LXM quality 
strongly depends on LCG multiplier quality, more sensitive, more specialized tests may be required. 


7.3. Results of PractRand Tests 


Many PractRand tests suggested LXM would generally produce sequences of good quality; for 
example, one set of 216 runs tested 36 different LXM generators with a tree 2 initialization strategy 
for stream counts 1, 2, 32, 2!, 2!5, and 27°; they detected no problems except that (a) for stream count 
2'1, 32-bit LXM generators with no mixing function failed after 128 GB; (b) 16-bit LXM generators 
using no mixing function generally failed; and (c) 16-bit LXM generators using madeup16 failed 
for stream counts 1 (at 1 TB) and 2 (at 4TB). Other tests picked up the (unsurprising) fact that 
starstar (never intended for this purpose) is a weaker mixer than MurmurHash3 and its variants. 
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Table 8. Comparative timings (all measurements in nanoseconds per word generated) 


size Haswell ARM 
(in bits) gcc clang gcc clang 
m a_ out] inline noinline | inline noinline| inline noinline | inline noinline 
L32X64 32 32 32 1.615 2.311 1.602 2.290 2.566 4.076 2.564 4.320 


L32XX64 32 32 32) 1.679 2.321 1.713 2.321 | 2.566 4.076 | 2.641 4.321 
L32X128 32 32 32) 1.547 2.503 | 1.549 2.464 | 2.885 4.444 | 2.690 4.859 
SPLITMIx — 64 64] 1.201 1.688 | 0.967 1.681 2.401 3.512 | 2.401 3.381 
L64X128 64 64 64] 1.614 2.488 | 1.679 2.232 | 3.601 4.603 | 3.602 4.393 
L64XX128 64 64 64) 1.528 2.650 | 1.713 2.239 | 3.601 4.609 | 3.602 4.399 
L64X256 64 64 64] 1.680 2.734 | 1.560 2.431 | 3.601 4.901 | 3.601 5.146 
L128AX128| 64 64 64] 1.859 2.915 | 1.837 3.122 | 7.602 8.006 | 6.402 7.350 
L128BX128] 64 128 64] 1.859 2.714 | 1.831 2.717 | 7.602 9.231 | 6.402 7.396 
L128CX128/128 64 64] 2.566 2.889 | 1.923 2.881 7.602 10.306 | 7.602 9.039 
L128DX128/128 128 64 | 2.563 3.232 | 1.933 2.856 | 7.602 10.481 | 7.602 9.133 
L128EX128} 65 64 64] 2.466 3.005 | 1.919 2.881 7.602 8.027 | 7.602 7.363 
L128FX128] 65 128 64] 2.465 2.780 | 1.929 2.772 | 7.602 8.430 | 7.602 7.378 
L128AX256| 64 64 64] 1.813 3.212 | 1.720 2.881 | 7.602 7.967 | 6.402 7.477 
L128BX256| 64 128 64] 1.812 2.883 | 1.773 2.881 | 7.602 8.027 | 6.402 7.373 
L128CX256/128 64 64] 2.571 3.129 | 1.919 2.933 | 7.602 9.165 | 7.602 9.193 
L128DX256/128 128 64 | 2.568 3.627 | 1.931 3.122 | 7.602 9.333 | 7.602 9.200 
L128EX256} 65 64 64} 2.542 3.120 | 1.920 2.881 | 7.602 7.377 | 7.602 7.401 
L128FX256| 65 128 64 | 2.537 3.231 1.930 2.929 | 7.602 7.683 | 7.602 7.396 


8 COMPARATIVE TIMING TESTS 


Table 8 shows timings for SpLIrMix and a selection of LXM generators representative of various 
configurations of generator state. We tested two architectures: an Intel® Core™ i7-8700B CPU 
@3.20 GHz (Haswell) and an AWS Graviton 2 processor based on 64-bit Arm Neoverse cores 
@2.5 GHz. We performed our tests using two different compilers, gcc 10 (11 on Haswell) and clang 
10. In each case, we tested the next-state function in two ways: forcing inlining, or blocking inlining; 
in the second case, the compiler has to reload the constants involved at each call, and we also pay 
for the function call itself. The two timings gives a differential view of the cost of pure computation 
(without constant loading) versus global cost. We report the average of ten runs; the measurements 
are very stable, with relative standard error below 1%, and in almost all cases below 0.5%. 

The results depend on both architecture and compilers; in general, clang faster code, especially 
with larger state and constants; the case of SPLITMrx is very evident, but note that the timing 
with gcc 10 on Haswell is similar to that of clang, so that specific gap appears to be a scheduling 
defect introduced in version 11. The main visible difference in timing is that between 64-bit and 
larger multipliers. We see no relevant difference between 65-bit and 128-bit multipliers, except for 
the no-inline case of the ARM architecture, where the additional time required to load the larger 
constant is very visible. For the same reason, the difference in timing between the inline and the 
no-inline benchmarks on ARM is sometimes very large. Always on ARM, clang generates better 
inline code for 64-bit multipliers with respect to larger sizes, contrarily to gcc. 

The size of the additive constant a (64 or 128 bits) appears to have little impact; in some cases, 
paradoxically a larger constant generates a shorter timing. Quirks of this kind are unavoidable 
because the models used by compilers to optimize instruction choice and pipelining are not perfect. 
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9 MORE ABOUT JUMPING AND SPLITTING 


The standard way to jump an XBG by j positions is use some precomputed representation of U/, 
then apply that matrix to the XBG state. One common convention is that “jump” advances by 
2"/2 positions and “leap” (“long jump”) advances by 2°”/* positions. This is advantageous for LXM: 
jumping by a power of 2 at least as large the period of the LCG leaves the LCG state unchanged. 
But the representation of U/ is typically not as efficient to apply as U. 

Imagine instead that we wish to make an LXM jump backwards by 2” — 1 positions; that would 
leave the XBG state unchanged, and put the LCG in the same state as if we had advanced the LCG 
just one position. So advancing just the LCG is a simple way to get a cheaper LXM jump function. 
And leaping backward by, say, 2*/?(2” — 1) positions is equally easy, because one can precompute 
constants m’ and a’ such that Ao.(m’o +a’) mod 2* will advance the LCG by 2/2 positions. 

But the point of jump functions is usually to create multiple generators in such a way that their 
generated sequences will not overlap. We believe (but admit that we have not yet proved) that the 
additive parameter provides a very simple way to do that if the mixing function is good: just ensure 
that each instance has a different additive parameter. Choosing the additive value at random, as the 
split() method does, may do that with high probability if 2* is sufficiently larger than the number 
of instances. On the other hand, it is very easy for the splits() method to ensure that all the 
generators in a single generated stream have different additive parameters; this is even easier than 
the cheap strategy for jumping. Testing seems to confirm that this strategy is effective, and splitting 
is easier to use than jumping in applications structured to use recursive fork-join parallelism. 


10 CHOOSING AN LXM ALGORITHM 


If we were presenting a single algorithm, our message would be simple: “use our new algorithm” 
But because we have described a large family of algorithms and tested several, we offer the following 
suggestions for how to choose an algorithm appropriate for a specific application. 

If an application requires a random number generator algorithm that is cryptographically secure, 
then no member of the LXM family is appropriate; programs coded in the Java language should 
continue to use an instance of the class java. security.SecureRandom. 

For applications with no special requirements, L64X128MixRandom has a good balance among 
speed, space, and period, and is suitable for both single-threaded and multi-threaded applications 
when used properly (a separate instance for each thread). But for a single-threaded application, 
Xoroshirol28PlusPlus is even smaller and faster, and certainly has a sufficiently long period. 

For an application running in a 32-bit hardware environment and using only one thread or a 
small number of threads, L32X64MixRandom may be a good choice for reasons of speed and space. 

For an application that uses many threads that are allocated in one batch at the start of the 
computation, one may prefer either a jumpable generator such as Xoroshiro128PlusPlus or 
Xoshiro256PlusP lus, or a splittable generator such as L64X128MixRandom or L64X256MixRandom. 

For an application that creates many threads dynamically, perhaps through the use of Java split- 
erators, we recommend a splittable generator such as L64X128MixRandom or L64X256MixRandom. 
If the number of generators created dynamically may be very large (millions or more), then 
L128X128MixRandom or L128X256MixRandom, which use a 128-bit (rather than 64-bit) parameter 
for the LCG subgenerator, will make it much less likely that two instances use the same state cycle. 

For an application that uses n-tuples of consecutively generated values, it may be desirable to 
use a generator that is k-equidistributed such that k > n. The generator L64X256MixRandom is 
provably 4-equidistributed, and L64X1024MixRandom is provably 16-equidistributed. 

For applications that generate large permutations, it may be best to use a generator whose period 
is much larger than the total number of possible permutations; otherwise it will be impossible 
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to generate some of the intended permutations. For example, if the goal is to shuffle a deck 
of 52 cards, the number of possible permutations is 52! (52 factorial), which is larger than 2?”° 
(but smaller than 27°), so it may be best to use a generator whose period at least 27°°, such as 
L64X256MixRandom or L64X1024MixRandom or L128X256MixRandom or L128X1024MixRandom. (It 
is of course also necessary to provide sufficiently many seed bits when the generator is initialized, 
or else it will still be impossible to generate some of the intended permutations.) 


11 RELATED WORK 


Schaathun [2015] has recently surveyed a number of techniques for splittable pseudorandom gener- 
ators. He traces the origin of the ideas to the 1980s, and in particular to Warnock’s work [Warnock 
1983] in particle physics, where splitting occurs when a particle being simulated spawns new 
particles. A few years later several studies proposed to use different additive constants of LCGs 
to perform splitting, generating a Lehmer tree, until Durst [1989] proved that such sequences are 
strictly correlated, as we discuss in Section 6.5.1. Notably, Schaathun concludes that the crypto- 
graphic approach of Claessen and Palka [2013], which uses cryptographic hashing on the splitting 
tree, is the safest and the only one providing some theoretical guarantees. Later, Steele, Lea, and 
Flood introduced SpiirMrx [2014, §7]; while they do not perform comparative measurements with 
Claessen and Patka’s approach, they conjecture that the latter should yield sequences with better 
statistical qualities than SprirMrx, while SptirM1x should be faster. 

Also the combination of generators of different nature has a long history. A relatively recent video 
on YouTube [Losego 2016] has reverse-engineered the code used for random number generation by 
the well-known video game Super Mario World [Nintendo 1990], which was released on November 
21, 1990. The code merits study as an example of excellent engineering within a severely resource- 
constrained computing environment (a Ricoh 5A22 CPU, closely related to the WDC 65C816), and 
it happens to be very closely related to the LXM algorithm. The generator produces two 8-bit bytes 
each time it is called; each byte is the result of one call to a subroutine. The subroutine implements 
two subgenerators, each with one 8-bit byte of state, and the output of the subroutine is the bitwise 
xor of the outputs s and t of the two subgenerators. One subgenerator is an LCG whose period 
is 256, and the other an XBG using an F,-affine state update function whose period is 217, so the 
overall period of the subgenerator (viewed as a generator of bytes) is 55552. (As far as we can tell, 
the principal advantage of using an F,-affine state update function rather than a purely F2-linear 
function—either would have been equally easy to implement—is that the state of the PRNG can be 
reset by zeroing both state bytes.) The overall period of the main generator (viewed as a generator 
of pairs of bytes) is therefore 27776. The update computation for the two subgenerators is 


s—5Xs+lt<— (t«1)0((t@(t<3)) >>7) 01 


The spectral quality of the multiplier 5 is suboptimal, but on a microprocessor with no multiply 
instruction, 5 is the fastest nontrivial multiplier that provides full period (the entire LCG update 
is five instructions). The period 217 for the xor-based subgenerator is not the best possible, but 
updating a subgenerator of period 255 would take more instructions; 217 is the longest period 
possible among xor-based subgenerators that use relatively few instructions (the entire update is 
eight instructions) and have odd period. Computing the bitwise xor of the subgenerator outputs 
rather than the sum saves one instruction on a microprocessor that has no add instruction, only 
add-with-carry. The result is a pRNG that is small, fast, and adequate in quality for the application. 
Generators in Marsaglia and Zaman’s KISS family [Marsaglia and Zaman 1993; Rose 2018] 
combine three or four generators of different nature to improve the randomness of the output. 
L’Ecuyer and Granger-Piché [2003] study combined generators with components from different 
families, focusing on combining one linear subgenerator with another subgenerator that may or 
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may not be linear. They prove that, under appropriate conditions, combining an LFSR (which is one 
kind of XBG) with another generator will preserve equidistribution properties of the LFSR. They 
also test a number of combined generators and conclude that “combining two different types of 
linear generators, such as a LCG or MRG with a LFSR, seems to do as well as the linear-nonlinear 
combinations, at least from the empirical perspective.” 

The xorgens generator [Brent 2010] combines an F2-linear generator using four xorshift op- 
erations with a Weyl generator. The author furthermore suggests subjecting the output of the 
Weyl generator to a simple mixing function Ao.o © rotate(o, y) (for some constant y ~ w/2) before, 
rather than after, adding it to the output of the xorshift generator. 

Recently a number of interacting online blogs and projects have reported discovering improved 
mixing functions, as well as improved tools and techniques for discovering and testing them 
[Ettinger 2019; Evensen 2018, 2019, 2020; Mulvey 2016; Wellons 2018, 2019]; we speculate that such 
mixers might provide useful improvements when used in LXM algorithms. They observe that on 
64-bit words, for any 0 < a < b < 64, At.t © rotate(t, a) © rotate(t, b) is a bijective transformation. 


12,» CONCLUSIONS AND FUTURE WORK 


At the end of their paper, Steele, Lea, and Flood [2014] commented: “It would be a delightful outcome 
if, in the end, the best way to split off a new PRNG is indeed simply to ‘pick one at random: ” Perhaps 
we have now achieved that: our testing suggests that if the arguments to the LXM constructor are 
themselves chosen uniformly at random—with no need to filter out any “weak values” other than 
ensuring that the additive parameter a is odd and that the initial state of the XBG subgenerator 
is nonzero—then the interleaved outputs of two or more generators constructed in this way will 
pass the BigCrush test suite [L’Ecuyer and Simard 2007; Simard 2009] and also the PractRand test 
suite [Doty-Humphrey 2011-2021] with extremely high probability. 

The SpriTM1x algorithm used in JDK8 has 127 bits of state (of which 64 are updated per 64 bits 
generated) and uses 9 arithmetic operations per 64 bits generated [Steele, Lea, and Flood 2014]. The 
64-bit LXM algorithm L64X128, which has a 64-bit LCG and xoroshiro128 as subgenerators, uses 
255 bits of state (of which 192 are updated per 64 bits generated) and uses 17 arithmetic operations 
(or possibly 14, on architectures that allow operations on 32-bit halfwords of 64-bit registers) per 64 
bits generated (see Figure 1). Our timing measurements confirm that on contemporary architectures 
and using popular compilers, the basic generate operation for L64X128 is somewhat slower than 
that for SptrrMrx, but never by more than a factor of 2. For applications in which it is desired to 
have a significantly smaller probability of statistical correlations among multiple generators being 
used by parallel tasks, especially when it is desirable to create new generator instances on the fly 
(for example, when forking new threads), L64X128 may be very attractive. This instance of LXM, 
and several others, will be provided in JDK17 later in 2021 as part of a new RandomGenerator API 
designed to make it easier for applications to use a variety of PRNG algorithms interchangeably. 

Work yet to be done includes (1) exploration of even better mixing functions, (2) exploration 
of different congruential components, such as Marsaglia’s multiply-with-carry generators, and 
(3) even more thorough testing of (a) LXM generator combinations and (b) a simplified generator 
that consists only of an additive constant (or a Weyl generator), an XBG generator, a combining 
function, and a mixing function. 
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